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achieves high efficiency in extracting clauses from large amounts of text. 
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1. INTRODUCTION 

The advancement in technology promoted the rapid increase in online information volume in recent 
years. Unstructured knowledge should be organized to take advantage of it and can be used in various fields. 
Information extraction is employed to organize knowledge from unstructured data. Information extraction (IE) 
methods extract structured information within relations, objects, entities, events, and many alternative sorts. 
The volume of structured, unstructured, and semi-structured knowledge and the speed of increasing huge data 
has led to the need for new techniques to deal with this size and type of information [1]. Several kinds of 
research treat information extraction among entity extraction and relation extraction in the Arabic language. 
However, the extracted information has different drawbacks such as uninformative, incoherent, and overly 
specific relations. These defects appear because of the special nature of the Arabic language with rich 
morphology. The Arabic language has complicated morphology, meaning each word may consist of one or 
more prefixes, a stem or root, and one or more suffixes [2]. 

Furthermore, different challenges face Arabic natural language processing (NLP), especially 
information extraction. Also, the Arabic language features a lack of capitalization, unlike other languages such 
as English, in which capital letters are used to recognize name entities. Over the last few years, the researchers 
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proposed and developed various systems and techniques to overcome obstacles in the Arabic language [1]. 
Also, the Arabic language faces many challenges due to its inflectional, derivational, and syntactic structures. 
The grammatical of Arabic sentence has many structures, mainly, nominal sentence composes of subject-verb- 
object (SVO), and verbal sentence consists of verb-subject-object (VSO) [3]. In addition to the complexity of 
the Arabic language, different representations, semantic interpretation, and the heterogeneity of data types are 
often the main problems and the intrinsic properties of the collected raw data massive data. These problems 
represent the main challenges for extensive data analysis. To overcome these challenges and perform big data 
analysis, it must be preparing and transforming these raw into a suitable form for analysis. So, the IE process 
must be efficient enough to handle heterogeneity, dimensionality, and data diversity [4]. 

Despite recent progress in IE, extracting information from the web presents several challenges for 
existing systems. There is massive and heterogeneous data on the web, interest relations are not easy to predict, 
and the number of relations can be huge [5]. One of the most significant challenges for parsers is robustness, 
the ability to analyze any input. These drawbacks lead to the use the open information extraction to facilitate 
the discovery of relations in large-scale text and heterogeneous corpora. The relation corpora extracted by open 
information extraction systems (OIE) systems are valuable resources for downstream tasks like automated 
knowledge base construction, open question responsive, event schema induction, generating illation rules, or 
for up OIE systems themselves [6]. Therefore, the Arabic OIE system is designed to overcome the identified 
challenges for big Arabic data and the limitations of existing IE techniques. Accordingly, this paper proposes 
a novel framework, called Arabic open information extraction (AOIE), to identify relation tuples in Arabic 
web text. The grammatical function of its coherent constituent determined the corresponding clause type for 
every relation. This system used a heterogeneous corpus in the (CoNLL-U) file format by the UDPipe 
application [7], [8]. 

The proposed system is expected to improve the information extraction for the Arabic language by 
providing relation tuples. The proposed system has been evaluated by determining its precision, F-measure, 
and recall. The results revealed the system's good performance while the precision reaches 91%, recall reaches 
84%, and F-measure is 87%. The system has also been applied in several fields: weather, social, sport, health, 
biomedical, and economical. The overall precision for each field consecutively is 91%,80 %, 81.8%, 91%, and 
88%. The evaluation has been done for different sentence complexity levels. These levels are simple, complex, 
highly complex, and extremely complex. This article is organized as follows, related works are presented, and 
in section 2, section 3 explains the methodology framework, section 4 is explained the proposed system, section 
5 illuminates result and evaluation, and ends with the conclusion and future work. 


2. RELATED WORK 

Current IE systems focus on analyzing the local context within individual sentences to extract entities 
and their relationships in a specific field while ignoring the redundant information that can be collectively [9]. 
In comparison with other languages, we observe a scarcity in efforts related to Arabic-based information 
extraction, which could be partly imputed to the complexity of Arabic makes it difficult to extract relations 
automatically [10]. The Arabic name entity recognition (NER) as a base for relation extraction applications has 
a significant share of this field research. Mesmia et al. [9] proposed a system for recognizing the Arabic NER 
based on two transducers for analysis and synthesis. Darwish and Gao presented simple, effective, and 
language-independent approaches for improving NER in microblogs for Arabic as an example [11]. Sabty et 
al. [12] proposed a system for extracting Arabic NER dependent on word embedding. A word embedding is a 
text's representation representing all similar meaning words. 

Furthermore, for relation extraction, El-Salam et al. [13] extracted binary relations between two 
Arabic-named entities in a specific domain from the web using a semi-supervised technique. Also, Fasha et al. 
proposed the information extraction model for Arabic text that relatively open-text domains. This model 
contains two-phase. The first phase extracts part-of-speech (POS) tagging relations. Using description logic in 
the second phase for extracting the implicit knowledge [14]. However, Open IE provides proper data 
compression, compared to search snippets or reading an original document, while still retaining important 
information. This can help an end-user in obtaining a summary view of a concept [10]. The previous work in 
OIE is divided into two generations regarding model considerations. The first generation is the training data- 
based OIE which generates patterns based on training data represented employing the dependency tree or part 
of speech (POS) tagged text [15]. This generation has two methods; the first method uses training data and 
shallow syntax to learn extractors or estimate the confidence of those systems relying on extensive human 
involvement [10]. The examples for this type are TextRunner [16] and ReVerb [17]. The second method is 
training data and dependency parsing, which performs POS tagging, syntactic chunking, and dependency 
parsing and returns a set of relation triples, for example, the OLLIE model [18]. The second generation is Rule- 
based OIE which relies on hand-crafted patterns from POS-tagged text or rules operating on dependency parse 
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trees [15]. This generation has two methods; the first method is the rule-based and shallow syntax which 
extracts relationships based on the simple constraint. Every relational is a verb or a verb followed by a 
preposition or a verb followed by nouns, adjectives, or adverbs, for example, ReVerb model [15]. The second 
method is rule-based and dependency parsing which uses hand-crafted heuristics operating on dependency 
parses, for example, the ClausIE model [19]. 

Although most of the research in OIE is interested in the English language, several research types 
focus on other languages, such as [20], which presents the German OIE system (GerIE) depending on hand- 
crafted rules working on dependency parsed sentences. Furthermore, Jia et al. [21] proposed an unsupervised 
model that extracts open entity relations and solves Chinese linguistic troubles. Also, the dependency semantic 
normal forms are used to extract entity relation triples. Truong et al. proposed a method of OIE for Vietnamese 
named (vnOIE) using a clause-based approach and generated open relationship and their arguments from 
Vietnamese. The model formulates Vietnamese dependency parsing considering all possible relationships in a 
sentence using grammatical clauses [22]. Niklaus et al. presented three significant challenges in Open IE 
systems. The first is automation applying the unsupervised extraction strategies in the open IE systems, 
automatically detecting possible interest relations with only a single pass over the corpus and automatically 
generating the relevant training data. The second is corpus heterogeneity which prevents or hinders the progress 
of the syntactic or dependency parsers. The last is efficiency. Open IE systems are effective if they can 
scalability and process a large amount of text in various domains [10]. 

Previous Arabic information extraction research and applications suffer from the high ratio of 
incoherent output information. Most of them are interested in a specific domain with supervised methods, 
making the method not used for different purposes. Also, the previous research focuses only on binary relation 
and name entity recognition while there is a lack of extracting the coherent ratio in Arabic text. OIE practices 
in different languages also shed light on the most suitable method to use in Arabic following vnOIE, which 
generates open relations from the Vietnamese language. The system reveals the effectiveness of using 
dependency parsing in complex language morphology. Accordingly, this paper's proposed system follows 
training data on heterogeneous corpus depends on the corpus used (CoNLL-U) file format and yields into the 
more type of relation type. In the next section, more highlights are provided about the proposed system. 


3. THE METHODOLOGY FRAMEWORK 

Regarding the challenges faced by the Arabic IE and the reviewing of the OIE system developed for 
different languages, this research proposes an information extraction system called Arabic open information 
extraction (AOIE). The system extracts relation tuples representing essential clauses or assertions from the 
text. To formulate effective Arabic OIE system, this research attempts to formulate the research framework as 
shown in Figure 1. The objective of this system is to yield Arabic clauses to get as coherent information as 
possible. The complex morphology of the Arabic language is the main obstacle to perform such extraction. 
Accordingly, some Arabic features like tokenization and part of speech (POS) have been used. The research 
framework consists of four stages as following. 

In the first stage, using the Arabic dependency parsing database developed by Mohamed et al. [8]. 
This corpus contains an index for the sentences and their linguistic meta-data to enable quick mining and search 
across the corpus. The dependency relation in this corpus has seventeenth morphological annotations and eight 
features based on identifying the textual structures then recognizing and understanding their grammatical 
characteristics to perform the dependency relation. The parsing and dependency process is conducted by the 
universal dependency system and corrected manually [8]. 

The second stage, build Arabic OIE system; in this stage, the proposed system for Arabic OIE has 
been built by determining the initial clause types. These clauses are grammatical parts depend on the 
dependency parsing for examples they are VS ("verb", "subject"), VSO ("verb", "subject t", "object"), VSOA 
("verb", "subject", "object", "adjective"), VO ("verb", "object"), VOA ("verb"," object",” adjective"),and VA 
("verb", "adjective"). Python programming language has been used to perform the proposed system depending 
on natural language processing libraries to deal with the Arabic dependency parsing feature included in the 
(CoNLL-U) format. 

The third stage, analyze and evaluate, analyze the results and evaluated them by checking the validity 
of each clause, and the feedback of this step is considered input for the next stage. Then, improve the system 
accuracy; this step tries to enhance the system's performance by manually improving the sequences and rules 
of the system algorithm. For example, in the verbal sentence universal part of speech tagging (UPosTag) = 
"VERB" and "Head” "0": then the Root = Verb Return (V), and If "Head" = Root" ID" and universal 
dependency relation (DepRel) = "nsubj": Return (S). Also, in the nominal sentence, if DepRel" value is "nsubj" 
set it in as the inchoative (I), and if "DepRel" value is "nmod" or "amod" set it as a Root and predicate (E). 
Finally, the system reaches a satisfying form in several domains to be implemented as approved from the 
measurement standards including, precision, recall, and F-measure. Therefore, AOIE aims to address three 
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problems: relying on supervised extraction strategies, Transportability as systems is intended for domain- 
independent usage, efficiency to scale to large amounts of text readily. Every year, the volume of unstructured 
data doubles [4], and the extraction of semantic information from such a huge amount of unstructured data 
become more important. 


First stage 


Arabic Dependency 
Parsing Based 
Corpora 


Second stage t 


Build Arabic OIE 
model 
Third stage J 
Y E 
Analyze and Improve 
Evaluate the Feedhâck w| the model 
output result accuracy 


Final stage ¥ 


Formulate the 
proposed model 


Figure 1. The research framework 


3.1. Arabic dependency parsing 

In the Arabic language, Arabic grammarians analyze all Arabic words into three main parts-of-speech. 
This parts-of-speech are distinguished and differentiated into more intricate parts for covering as a whole of 
the Arabic language. These parts are: i) noun: a noun in Arabic is a name or a word that describes a person, 
thing, or idea, Verb: it's the most crucial word in the sentence; ii) the Arabic verb classified into perfect, 
imperfect, and command. Furthermore, the verb can be classified based on gender, number; and 
iii) particle: the particle includes prepositions, adverbs, and conjunctions [23]. The parts-of-speech (POS) tag 
is an important feature; these are used in open relation extraction. The Universal dependencies framework has 
been used to match different types of dependency relations in different languages. There are seventeen 
dependency relation types provided by parsers trained on Arabic-padt-ud-2.4-data, among the subject and 
object, relative, adverbial and adnominal clauses, conjunction, auxiliary, and parataxis [24]. Natural language 
toolkit (NLTK) is an open-source suite of libraries and programs that can be integrated within the Python 
environment and then used to perform different statistical and rule-based natural language processing tasks 
POS tagging and parsing [25]. NLTK Appling the top-down as shown in Figure 2 to parse the following Arabic 
sentences and produce the corresponding dependency-parsing tree for the sentence: 


"Coronaviruses are a broad spectrum of viruses that may cause disease in animals and humans" 
VaYs Gl gall yall ar ab gl la gill Cy May ADs e Ggs Calan g gab pie 

The resulted tree illustrates the parts of speech (POS) tag set, where are some verbs such as; "Os"; "522"; and 
"ï"; and some nouns such as "U5 9S Cale g 8" open yall MyLI My) gas 

Dependency trees are suitable for various languages [26]. A multilingual parser has been used with a 
common output tag set for representing the syntactic structure of a sentence. Universal dependency relation 
has several types, as shown in Table 1. The beginning of the tree is usually the verb that presents the sentence's 
root connected with nouns, and other adjectives and adverbial connect these nouns. A well-formed dependency 
tree for an input sentence is simply a tree with the appropriate nodes, their nodes map, one to one, to the tokens 
due to the morphological analysis and tokenization, and their roots collect the nodes according to the division 
into sentences or paragraphs. The conversion of these trees was the easiest task as the linguistic representation 
was already what we needed [27]. Since the (CoNLL-U) has promoted multilingual dependency parsing and 
provided resources for this, much progress has been made in this area, and the number of freely available 
dependency parsers has increased. 
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Figure 2. Dependency parsing structure of Arabic sentence 


Table 1. Dependency parsing relation 


Rel Definition Rel Definition 
Root points to the root of the Parataxis Used to connect to sentences together Advmod adverbial modifier 
sentence 
Nsubj A nominal subject Obl adverbial attaching to a verb, adjective Aux Auxiliary 
Amod Adjectival modifier Case providing a more uniform analysis of Conj Conjunct 
nominal elements 
Obj Direct object Nmod nominal modifier Xcomp open clausal complement 
Fixed certain fixed grammaticized Det Determiner Comp comparison constructions 
Cc coordinating conjunction Acl an adverbial clause 


3.2. Arabic dependency corpora 

In this study, we conducted experiments on the Arabic dependency parsing based corpora for 
information extraction, which has been presented in (CONLL-U) format as dependency parsing (DP) input [8]. 
This corpus depends on text from the web and includes several fields they are weather, economic, social, sport, 
health, and biomedical. Table 2 illustrates the sample of the Corpora generated by UDPipe model. UDPipe 
model conducts (CoNLL-U) format files by performing tokenization, morphological analysis, POS tagging, 
lemmatization, and dependency parsing for nearly Universal Dependencies 2.5 [28], [29]. UDPipe was 
developed at Charles University in Prague [28], [30]. The UDPipe output is containing universal POS tags 
(UPOS), language-specific POS tags (XPOS), a universal subset of morphological features (UFeats), 
Lemmatization (Lemmas), Universal dependency relation (DepRel), and the Head (root if Head = 0) [28], [31]. 
The following section introduces the proposed system of Arabic open information extraction. 


Table 2. Sample from the Arabic dependency parsing-based corpora for information extraction 
Id Form Lemma UPosTag  XPosTag Feats Head DepRel 
# newdoc 
# newpar 
# sent_id = 1 
# text =Y] lall a pall ad ab call La g pall Gye Analy AID Lig gS Sahn y ped puted 
Aspect=Imp|Gender=Fem|Mood=Ind|Number=Sing|Person=3|Ver 


1 5 dtc! VERB  VIIP-3FS-- bForm=Fin|Voice=Pass 0 root 
2 Chey ys hs; NOUN  N------ IR Case=Nom|Definite=Cons|Number=Plur 1 nsubj 
3 Us Lys _ 2 nmod 
4 ‘ ‘ - 3 punct 
5 4D aLa Case=Nom|Definite=Ind|Number=Sing 1 obj 
6 ily els Case=Nom|Definite=Ind|Gender=Fem|Number=Sing 5 amod 
7 uo os AdpType=Prep 8 case 
8 ches pill Guy Case=Gen|Definite=Def|Number=Plur 5 nmod 
9 Äl ill Case=Gen|Gender=Fem|Number=Sing|PronType=Rel 11 nsubj 
10 ï a n 11 advmod:emph 
ll oai ok Aspect=Imp|Gender=Fem|Mood=Ind|Number=Sing|Person=3|Ver 8 sei 
ue E bForm=Fin|Voice=Act 

12 oad Use Case=Acc|Definite=Def|Number=Sing 11 obj 
13 dos olya X X--------- Foreign=Yes 14 Nmod 
14 Yl aly X U--------- = 12 Nmod 
15 : 3 PUNCT G--------- _ 1 Punct 
# text = sal) haal ole Gell leall sa pies Ll Skee 

1 je J> NOUN N------ SIR Case=Nom|Definite=Cons|Number=Sing 4 nsubj 
2 jd jad IDAFA  U--------- - 1 nmod 
3 E E PRON SP---MS1- Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs 4 nmod 
4 J J NOUN N------ SID Case=Nom|Definite=DefjNumber=Sing 0 root 
5 alah Gal ADJ A-----S1D_ Case=Nom|Definite=Def|Gender=Masc|Number=Sing 4 amod 
6 oe Ut: NOUN N------ SIR Case=Nom|Definite=Cons|Number=Sing 4 nmod 
7 baai =i NOUN N------S2D Case=Gen|Definite=Def|Number=Sing 6 nmod 
8 œl 55 NOUN A-----S2D Case=Gen|Definite=Def|Gender=Masc|Number=Sing 7 amod 
9 3 PUNCT _ G--------- - 4 punct 
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4. THE PROPOSED SYSTEM 

Open information extraction (OIE) is an unsupervised task to extract coherent information task from 
the text. Its output represents the basic clauses or assertions from the text. Clauses can be defined as coherent 
and pieces of basic information that are non-over-specified. Arabic clause is a grammatical unit. It is considered 
part of a sentence that expresses some coherent information [21], [22], [32]. 

Several stages and steps have been followed to build, test, and improve the proposed system as show 
in Figure 3, and an Arabic dependency database developed by Mohamed el al. [8] has been used as shown in 
the algorithm of identifying clause type. The used database is an Arabic dependency parsing-based corpora 
including the different grammatical features such as universal part of speech tagging (UPosTag), Head, and 
DepRel. In the beginning, the initial clause types have been determined following the Arabic sentence elements. 
After that, the grammatical rules for identifying each element in the clause have been determined dependent 
on the initial clause types, and the dependency features these existing in (CONLL-U) files. Consequently, the 
system is constructed by employing python programming language using several open-source packages. The 
source code of the proposed system is uploaded to GitHub [33]. 

Initially, every sentence in the corpora is separated. The following algorithm of Identifying clause 
types illustrates the sequence of detecting each clause in the sentence and identifying their types. For the verbal 
clauses, based on the initial clauses' types and the dependency features these existing in (CONLL-U) files, the 
system tries to find the verb (V) in the clause. If the verb is found, then set it as the root. To identify the clause 
parts, determining the ID of the clause words, then find the related elements such as subject (S), object (O), 
and adverb (A). Depending on the discovered elements, the clause type is determined as VS, VSO, VO, VOA, 
VA, SV. If the verb is not found and identified, the clause is recognized as a nominal sentence, and the first 
word (noun) is set as inchoative (I), and the algorithm tries to find the predictive (E) using the grammatical 
rules, which is the noun follows the inchoative in the sentence. Consequently, the clause type is determined as 
(IE). 


Using Arabic Dependency Parsing 
Based Corpora 


Determine the initial clauses type 


Triple 


“verb”, “subject” 


| 
| 
| 
| 
| Produce the Clauses 1 
| = 


Jael gap mei BD ay yD ed Ay Bg od | 


Divide every sentence in a separated sheet 


d 
' l ; 
Determine the rules to find Recognize the Clauses E 
clause elements 
ry Find the verb Verb is not found 
Verb is found y 
—— | G oie Check the validity of 
clause 
Root = the verb a extracted clauses 
Į S 
For each Clause For each Clause F z 
| Į m| Calculating Model Accuracy f 
iz rua ha lace (al Inchoative(!) = root Precision | Recall [measure] z 


ee cee 


—___ 
1 Find Predicates (E) 
Determine Clause Type 


Improve the model by modify the 
grammatical rules 


y 
m t — | | Clause Type is (IE) 
vs] [vso] [vsa] | T 7 


Figure 3. The proposed system building steps 
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Algorithm: Identifying clause types 


Input: DP from a sentence 
Output: Set of clauses and their types 
Do Identifying all clauses in the sentence // recognize each clause in the 
Root, S, V, O, A, I, E = Null sentence 
For every word in clause // S is Subject, V is Verb "Root" 
//Find clause root //O is Object, A is Adjective 
If UPosTag = = verb and Head = = 0: // verbal clauses 
Root = (V) // (Root_ID) 
//Assign clause parts 
Find (S) 
Find (0) 
Find (A) 


If (S, O and A are found) then 
Clause Type €VSOA 

Else if (S and O are found) then 
Clause Type € VSO 

Else if (S and A are found) then 
Clause Type © VSA 

Else If (S is found) then 
Clause Type € VS 

Else if (A is found) then 


Clause Type €&vA 
End if // a nominal clause 


// I is inchoative, E is the 


Else if UPosTag != verb A 
Root = (I) predicate 
Find (E) 

Clause Type € IE 

End if 


Spill clause 
Save clause type and clause tuple 
End For 
End Do 
Save Sorted clause elements and types in excel 
sheet 


After determining the clause type, the clause has been split, and the same processes were repeated for 
all clauses in the sentence. These processes are repeated for all separated sentences, and the result is sorted in 
an Excel sheet containing each clause element and its type. After the validity of extracted clauses has been 
checked manually and the system accuracy has been calculated using the precision, recall, and 
f-measure. The grammatical extraction rule was modified, and the system was tested to improve the results 
until the final results were achieved. For example, the sentence shown in Figure 2 has been processed as 
following sentences: 


MOLY 5 olal oa pall Guach ob i a g ail) Gye Anal g ADDL e Lig gS Glas x8 ii" with the English translation 
"Coronaviruses are a broad spectrum of viruses that may cause disease in animals and humans" 


The (CONLL-U) format file for this sentence has been used in the extraction processes shown in 
Table 3. First, the system seeks for the verb (V) by finding "UposTag" value which is "VERB" in this case, if 
the "Head" of this verb is "0" the verb is considered as the root for this clause which is "55" in this example 
then the system finds the linked noun with the root and its "DepRel" value is "nsubj" and consider it as subject 
(S) which is "4555S Gls s 4", After that, the system looking for the linked words to the root and its "DepRel" 
value is "obj" or "obl:org" and consider it as the object (O) which is "4s", and the system looking for the 
words with "DepRel" value is "case" or "obl" and consider it as the object (A) which is "Gt 5 23!) Gs" then the 
first clause in the sentence is complete and its types set as (VSOA). Finally, the clause sorted in the results as 


following: 
Va pied S= lig S Shay pd Ona, A= Shes pill 4s S=Coronavirus V= isa strain A= of virus 


For the rest of the sentence, the same processes have been done, and the following is the result for the second 
clause in the sentence: 


sand $= il Gly ill Osoa S= viruses that may V=cause O=disease 


Table 3 illustrates the result of extracted clauses containing clause types and elements for the rest of 
the paragraph. Another example sentence is: "go~! haal Gals: oaa jleall ss he LN lea" with English 
translation "Barometer is a device for measuring atmospheric pressure." The (CONLL-U) format file for this 
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sentence has been used in the extraction processes shown in Table 2. First, the system finds the first noun of 
the sentence "UposTag" value which is "NOUN" in this case and check if "DepRel" value is "nsubj" then set 
it in as the inchoative (I) and looking for all linked words which are" s394 jka ," after that check on the 
following noun and set it as a predicate (E) and looking for the word linked with it which are " oat jlgall 
(6 ga) haal aki" then the clause items as follows: 

— [= iil Slee, E= yl yall ian ull Lanai wll ileal 

I= thermometer device, E= is a device specially designed to measure the temperature. 

For the rest of the text, Table 3 illustrate the result of extracted clauses containing clause types and elements. 


Table 3. Example for analyzing Arabic sentence that illustrated verbal sentences clauses 


clause 


Type Text 


Sentence 


Derived clauses 


VSOA G+ deuly Due Wy) Clay yd i“ 
osal Gayl ed Ba Glas pall 
he yd Ge base cj Gy rll Gay GLO; 
adla g aN) al po) Ql AndLAl 2 ull GY 53 
(Cass) Api) Leas gY) Gl) Le jAi Se 
5 (esha) Lash sll alali Apaditll Le pial, 
aye Ía pa CAS Lig 9S Gad Ges 

19-238 S 
"Corona viruses are a broad 
spectrum of viruses that may cause 
disease in animals and humans. A 
number of coronaviruses are known 
to cause respiratory diseases, the 
strength of which ranges from 
common colds to more severe 
diseases such as Middle East 
Respiratory Syndrome (MERS) and 
severe acute respiratory syndrome 
(SARS). Coronavirus, recently 
discovered, causes Covid-19 
disease." 


VSO 


VSOA 


VSO 


VA clad! es aaj zeal Gs lae aain" 
Jie ge yl) jes, lgie y L iihi å ilal 
alia 9 ell Lagu’ mill eall ga 
wae oal keal a ies LI Slee 

"gs Jax}! 
"A number of devices are used to 
monitor the type of climate 
prevailing in and from an area. A 
thermometer is a device specially 
designed to measure the temperature. 
Barometer is a device for measuring 
air pressure." 


IE 


IE 


oll GG pall Ga 4.1, MS. GOES BG RE Seay 
"Obey s Ol all a pall Gani a 


Cakes gill Gye Anal g ADhe Lig 5 gS Glas ed iad" 
MGs s ol sal a pall cud 38 (Sl 
"Coronaviruses are a broad spectrum of 
viruses that may cause disease in animals 
and humans." 
EE Gy 5S Shey yd Ge BE of ag all Gay" 
col) AL 9 ll BY a i Cg A Ga 
aug) Gill jAh ie duly adi Galil 
iaag Sola] Apii Aa Malls (Kaye) Api 
"(a ) 
"It is known that a number of coronaviruses 
cause respiratory diseases, the strength of 
which ranges from common colds to more 
severe diseases such as the Middle East 
Respiratory Syndrome (MERS) and severe 
acute respiratory syndrome (SARS)." 
POSS US pe A po ASI «Lg E Gk 5" 
pale 
"Coronavirus, recently discovered, causes 
Covid-19 disease" 
coh tiled) Elid g gi ea fl o Seal Gs Jane pasini " 
seal] eal a plage pall Glee . eis y Le thin 
"lall days Cala Lanai 
"A number of devices are used to monitor the 
type of climate prevailing in and from an 
area. A thermometer is a device specially 
designed to measure the temperature." 


haal (ohh Gold!) jleall a ies lll Sha‘ 

U6 all 
“Barometer is a device for measuring 
atmospheric pressure” 


V= S= Ug y 8 Shes yb ORAL 
A= Oke 5 jill Gs 
S=Coronavirus, V= is, 
A= astrain of viruses 


Hons Sa ley pill 
O= ll 
S= viruses that may, V=cause, 
O=disease 


Veni 
Salis 8 hay yd Ye hase 
Osini Ul jal 
V=caused, S= a number of, 
O = respiratory diseases 
Vac sli S= À 
O= 895 Asel 
S= strength, V=ranges, 
A= to more diseases 


V= S= 
O=19-a8 S U4 ja 
S=Corona virus, O=discovered, 
V=causes A=Covid-19 disease 
V= dial 
A= ailull bld g 53 ua 13 gal ys 
V= used, A= from an area 
T= fe ge tll jen 
E= 3) pall dao pull Lanai mill ileal 
I= thermometer device, E= is a device 
specially designed to measure the 
temperature. 
T= oil Slee , E= ebb oala Seal 
baa S sal 
I= thermometer device, E= atmospheric 
pressure measuring device 


4.1. Evaluation results and discussion 


To evaluate our system's extraction, we consider two aspects of each extracted fact: i) coherent or 
incoherent and ii) minimal or not. A fact is considered coherent, if it keeps the same meaning as in the original 
sentence. A coherent fact may contain another fact in its arguments. Wrong boundaries, where the relational 
or argument phrase is either too long or too small. Redundant extraction, as the extraction proposition is already 
expressed in another extraction. Uninformative extraction, as important information is skipped. Missing 
extraction (false negative), where an existing relationship is not extracted. The wrong extraction, as there is no 
meaningful interpretation of the proposition. Finally, measure the quality and efficiency of the system by 
quantifying the precision, recall, and F-measure using (1): 


True positives 


(1) 


precision SS a a 
True positives+false positives 


Equation (1) presents the precision, which calculates the number of items identified as the number of 
correctly predicted items. The system's precision has been measured by the ratio between all the extracted facts 
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and their coherence. We use this ratio to measure the overall precision. Equation (2) presents the recall, which 
measures how much relevant information the system has extracted. The recall represents a percentage of the 
total number of correct items for a given topic as the number of correctly predicted items. Equation (3) presents 
the F-measure to evaluate the overall performance of the system. The above measures have been done for 
different fields based on several categories of sentences: simple, complex, highly complex, and extremely 
complex [34]. These categories are determined based on the number of clauses in the sentence. Table 4 shows 
the results containing each category's ratio and precision in every field [10], [35]. 


True positives 
Recall = uep - (2) 
True positives+false negative 
precision+recall 
F — measure = 2s ————— (3) 
precision+recall 


The complexity of sentence structure affects the number of extracted relations and system precision. 
However, AOIE achieves high efficiency in determining the relationship between a subject, object, and verbs 
based on DP analysis, and most of the extracted clause is correct. However, the simple sentence produces one 
clause and the complex sentence, 2-3 clauses. In highly complex sentences, the numbers of generated clauses 
are 4-5 clauses, while the extremely complex sentence maybe produce more than five clauses. Table 5 
illustrates the system's efficiency applied in different fields of the Arabic text by presenting the overall 
precision, ratio, and f-measure for each field. Although there is a limitation in Arabic open information 
research, a binary relation extraction developed by [13] has chosen to compare with AOIE shows a higher 
accuracy while binary relations model precision ranges from 0.61 to 0.75 while recall ranges from 0.71 to 0.83 
and AOIE precision is 0.91 and recall 0.84 as shown in Table 6. 


Table 4. Experiment results in different sentence structure cases 


Field Category of sentence Ratio Precision Field Category of sentence Ratio Precision 
Sport & Health Simple 45% 80% Economic Simple 20% 94% 
Complex 40% 80% Complex 25% 719% 
High Complex 10% 82% High Complex 30% 85% 
Extremely Complex 5% 90% Extremely Complex 25% 90% 
weather Simple 50% 95% Simple 50% 95% 
Complex 30% 84% Biomedical Complex 30% 84% 
High Complex 10% 82% High Complex 10% 82% 
Extremely Complex 10% 80% Extremely Complex 10% 80% 
Simple 65% 89% NEWS Simple 40% 90% 
Economic & Social Complex 10% 90% Complex 20% 50% 
High Complex 15% 85% High Complex 25% 72% 
Extremely Complex 10% 88% Extremely Complex 15% 42% 


Table 5. Experiment results for different field 


Field Precision Recall | F-Measure Field Precision Recall _F-Measure 
Weather 91% 84% 87% Biomedical 84% 69% 75% 
Economic & Social 81% 68% 73% Economic 88% 79% 83% 
Sport & Health 81.8% 64% 71% News 67% 50% 57% 


Table 6. Comparison between the previous model and the proposed system 


Comparison item Binary relations model AOIE 
PRECICION 75% 91% 
RECALL 83% 84% 
F-MEASURE 76% 87% 


This research is the first comprehensive of the relation extraction system in the Arabic language to 
the best of our knowledge. AOIE is the first Arabic open information extraction system based on the Arabic 
language's grammatical clauses, highly scalable in terms of clause extraction, and domain-independent. The 
system exploits DP analysis to extract relation tuples based on grammatical clauses quickly. This system 
achieves a precision for performance from 71% to 91%, the recall from 83% to 84, and the F-measure from 
76% to 87%. However, some unexpected incorrect extractions could result from the output of Arabic sparing. 
Unlike English DP, the DP in Arabic may not detect the details of auxiliary adverbs, which could result in 
incorrect extractions caused by wrongly labeled main verbs. The problem was due to AOIE using heuristic 
rules to find significant verbs in a sentence based on DP. Secondly, Arabic DP has a limitation on distinguishing 
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between essential adverbs and verbs. Essential verbs are required, while adverbs may or may not appear in 
extracted clauses. 

In some cases, the AOIE system failed to determine clauses such as VS, VSO, VO, VOA, VA, where 
components S are subject, V are verb, O are objects, and A is an adverb. Accordingly, the proposed system's 
implementation addresses the previous problems mentioned in the literature while the system relies on 
unsupervised extraction strategies and is implemented in several domains. The results also prove that the 
system achieves high efficiency in extracting clauses from large amounts of text. 


5. CONCLUSION 

This system presents a solution for the Arabic resources shortage problem where language has a 
limited number of available resources to be used in NER systems. This paper presents the first attempt to 
implement the Arabic open information extraction systems. The proposed system takes advantage of the 
grammatical clause-based approach. By using grammar rules, the proposed system extracts all possible clauses 
in a sentence. The proposed system identifies the corresponding clause type based on propositions as 
extractable relations and constituents’ grammatical functions. In the experiments, the system has been evaluated 
using several factors such as grammatical structures of sentences and the number of verbs existing in a sentence. 
Also, the proposed system addresses the problem of using supervised strategies while the system relies on 
unsupervised extraction strategies. Then, the system has been implemented in several domains to avoid 
information extraxtion in a specific field. The results prove that the system achieves high efficiency in 
extracting clauses from large amounts of text. The proposed system's output is a set of new relations that 
contains the most important part of the sentence. The results show that the proposed system delivers promising 
results. AOIE could be applied to Arabic answering systems or integrated into higher Arabic NLP tasks such 
as text similarity or text summarization. the complex nature of the Arabic language while the nominal sentence 
not has a specific form and not contain a verb which makes it difficult to extract its parts. In this regard, future 
research should interest in solving such problems. 
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